-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HaplotypeCallerSpark extend AssemblyRegionWalkerSpark #5386
Make HaplotypeCallerSpark extend AssemblyRegionWalkerSpark #5386
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5386 +/- ##
==============================================
+ Coverage 86.985% 87.025% +0.04%
- Complexity 30423 30439 +16
==============================================
Files 1850 1852 +2
Lines 140944 140936 -8
Branches 15505 15508 +3
==============================================
+ Hits 122600 122650 +50
+ Misses 12705 12647 -58
Partials 5639 5639
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a whole lot to say about this branch. I think it is a good change to fold the haplotype caller into an assembly region walker, though we may end up doing some extensive work on the walker at some point .
new FeatureContext(features, assemblyRegion.getExtendedSpan()))).iterator(); | ||
AssemblyRegionEvaluator assemblyRegionEvaluator = supplierBroadcast.getValue().get(); // one AssemblyRegionEvaluator instance per Spark partition | ||
final ReadsDownsampler readsDownsampler = assemblyRegionArgs.maxReadsPerAlignmentStart > 0 ? | ||
new PositionalDownsampler(assemblyRegionArgs.maxReadsPerAlignmentStart, header) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the downsampler want to be configurable at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that's needed yet since the examples and HaplotypeCaller both use the PositionalDownsampler. There would be a bit of work to ensure the downsampler is serializable.
new DownsampleableSparkReadShard( | ||
new ShardBoundary(shardedRead.getInterval(), shardedRead.getPaddedInterval()), shardedRead, readsDownsampler))) | ||
.map(shardedRead -> { | ||
final Iterator<AssemblyRegion> assemblyRegionIter = new AssemblyRegionIterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential optimization we may want to look at for the haplotype caller is being able to generate these assembly regions and reconstruct the constituent up the reads later to save on holding an entire bam in an rdd at a time. That doesn't seem easy to do given the way this class is structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a branch based on this one that does exactly that :-)
|
||
@Override | ||
protected AssemblyRegionArgumentCollection getAssemblyRegionArgumentCollection() { | ||
return new ExampleAssemblyRegionWalkerSpark.ExampleAssemblyRegionArgumentCollection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these arguments getting pulled in from the example tool? Why can't this be its own argument collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a mistake - I've fixed it now.
a1730f7
to
a7caab2
Compare
…ll allowing ReadsPipelineSpark to share common code.
a7caab2
to
9d1c3b5
Compare
Thanks for the review @jamesemery. I have tried these changes on a cluster with an exome-sized input, with no performance issues, so I'd be comfortable merging. As you say it will be a useful basis for the HaplotypeCallerSpark work we are doing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch looks more or less good to me. Considering that these changes are temporary before your future changes to this machinery I would say this could be approved 👍
…titute#5386) * Factor out command line options for assembly region walkers (Spark) * Make HaplotypeCallerSpark extend AssemblyRegionWalkerSpark, while still allowing ReadsPipelineSpark to share common code.
This brings HaplotypeCallerSpark into closer alignment with the walker version.
cc @jonn-smith @jamesemery @droazen